Intermediate pandas

Now that we are familiar with pandas' data structures, we can turn our attention to some of the intermediate features of data frames, which include:

  • Easy handling of missing data
  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
  • Hierarchical labeling of axes
  • Sorting and ranking of data in DataFrames
  • Data summarization tools

In this section, we will manipulate data collected from ocean-going vessels on the eastern seaboard. Vessel operations are monitored using the Automatic Identification System (AIS), a safety at sea navigation technology which vessels are required to maintain and that uses transponders to transmit very high frequency (VHF) radio signals containing static information including ship name, call sign, and country of origin, as well as dynamic information unique to a particular voyage such as vessel location, heading, and speed.

The International Maritime Organization’s (IMO) International Convention for the Safety of Life at Sea requires functioning AIS capabilities on all vessels 300 gross tons or greater and the US Coast Guard requires AIS on nearly all vessels sailing in U.S. waters. The Coast Guard has established a national network of AIS receivers that provides coverage of nearly all U.S. waters. AIS signals are transmitted several times each minute and the network is capable of handling thousands of reports per minute and updates as often as every two seconds. Therefore, a typical voyage in our study might include the transmission of hundreds or thousands of AIS encoded signals. This provides a rich source of spatial data that includes both spatial and temporal information.

For our purposes, we will use summarized data that describes the transit of a given vessel through a particular administrative area. The data includes the start and end time of the transit segment, as well as information about the speed of the vessel, how far it travelled, etc.


In [ ]:
import pandas as pd
import numpy as np

vessels = pd.read_csv('../data/AIS/vessel_information.csv', index_col=0)

Indexing and Selection

Indexing works analogously to indexing in NumPy arrays, except we can use the labels in the Index object to extract values in addition to arrays of integers.


In [ ]:
# Sample Series object
flag = vessels.flag
flag

In [ ]:
# Numpy-style indexing
flag[:10]

In [ ]:
# Indexing by label
flag[[298716,725011300]]

In a DataFrame we can slice along either or both axes:


In [ ]:
vessels[['num_names','num_types']]

In [ ]:
vessels[vessels.max_loa>700]

The indexing field loc allows us to select subsets of rows and columns in an intuitive way:


In [ ]:
vessels.loc[720768000, ['names','flag', 'type']]

In [ ]:
vessels.loc[:4731, 'names']

Slicing also works with string variables, since an index has an intrinsic order, regardless of label:


In [ ]:
vessels.loc[:310, 'flag':'loa']

In addition to using loc to select rows and columns by label, pandas also allows indexing by position using the iloc attribute.

So, we can query rows and columns by absolute position, rather than by name:


In [ ]:
vessels.iloc[:5, 5:8]

Exercise

You can use the isin method query a DataFrame based upon a list of values as follows:

data['color'].isin(['red', 'blue'])

Use isin to find all ships that registered in either Denmark or Japan. How many records contain these values?


In [ ]:
# Write your answer here

Indexing with where

Pandas DataFrame objects also posess a where index for indexing that returns the values that satisfy the condition, but retain the index of the original DataFrame, so that the shape does not change. This is important when alignment is required for operations between DataFrames.


In [ ]:
np.random.seed(42)
normal_vals = pd.DataFrame({'x{}'.format(i):np.random.randn(100) for i in range(5)})

normal_vals.head()

In [ ]:
normal_vals.where(normal_vals>0).head()

where includes an optional other argument that accepts a scalar or tabular values (or a callable) to replace values in the DataFrame that do not satisfy the condition.

For example, we can use this to return the absolute values of X:


In [ ]:
normal_vals.where(normal_vals>0, other=-normal_vals).head()

Similarly, a callable can be used when we need to modify the replaced value:


In [ ]:
normal_vals.where(normal_vals>0, other=lambda y: -y*100).head()

Conversely, mask is the inverse boolean of where:


In [ ]:
normal_vals.mask(normal_vals>0).head()

Selection with query

At times, selection using indexing can be verbose because it requires repeated use of the DataFrame namespace.


In [ ]:
normal_vals[(normal_vals.x1 > normal_vals.x0) & (normal_vals.x3 > normal_vals.x2)].head()

For a more concise (and readable) syntax, we can use the new query method to perform selection on a DataFrame. Instead of having to type the fully-specified column, we can simply pass a string that describes what to select. The query above is then simply:


In [ ]:
normal_vals.query('(x1 > x0) & (x3 > x2)').head()

The DataFrame.index and DataFrame.columns are placed in the query namespace by default. If you want to refer to a variable in the current namespace, you can prefix the variable with @:


In [ ]:
min_loa = 700

In [ ]:
vessels.query('max_loa > @min_loa')

Operations

DataFrame and Series objects allow for several operations to take place either on a single object, or between two or more objects.

For this section, we will import a complementary dataset of transits (voyages) of each vessel. This is a much larger dataset, since ships each have multiple transits over several years.

For example, we can perform arithmetic on the elements of two objects, such as combining distances travelled across years. First, let's artificially construct two Series, consisting of segment lengths travelled in 2009 and 2010:


In [ ]:
segments = pd.read_csv('../data/AIS/transit_segments.csv', parse_dates=['st_time'])

In [ ]:
segments['year'] = segments.st_time.dt.year
segments.year.head()

In [ ]:
segments2009 = segments[segments.year==2009]
lengths2009 = pd.Series({name: segments2009[segments2009.name==name].seg_length.sum() 
                         for name in segments2009.name.unique()})

In [ ]:
segments2010 = segments[segments.year==2010]
lengths2010 = pd.Series({name: segments2010[segments2010.name==name].seg_length.sum() 
                         for name in segments2010.name.unique()})

We will learn a much more efficient way of performing this operation when we get to GroupBy operations a little later on.


In [ ]:
length2009 = segments.loc[segments.year==2009, 'seg_length']
length2009.index = segments.name[segments.year==2009]

length2010 = segments.loc[segments.year==2010, 'seg_length']
length2010.index = segments.name[segments.year==2010]

Now, let's add them together, in hopes of getting 2-year travel totals:


In [ ]:
seg_lengths = lengths2009 + lengths2010
seg_lengths

Pandas' data alignment places NaN values for labels that do not overlap in the two Series. In fact, the majority of ships only have data for one of the two years.


In [ ]:
seg_lengths.notnull().mean()

While we do want the operation to honor the data labels in this way, we probably do not want the missing values to be filled with NaN. We can use the add method to calculate ship travel totals by using the fill_value argument to insert a zero for years that do not have data for a particular vessel:


In [ ]:
lengths2009.add(lengths2010, fill_value=0)

Operations can also be broadcast between rows or columns.

For example, if we subtract the maximum LOA (ship length) from the max_loa column, we get a the differences between the size of a given vessel to the largest vessel.


In [ ]:
vessels.max_loa - vessels.max_loa.max()

Or, looking at things row-wise, we can see how a particular vessel compares with the rest of the group with respect to important features:


In [ ]:
vessels[vessels.max_loa==vessels.max_loa.max()]

In [ ]:
recs = vessels[['num_names','num_loas', 'max_loa', 'num_types']]
diff = recs - recs.loc[354092000]
diff[:10]

We can also apply functions to each column or row of a DataFrame


In [ ]:
recs.apply(np.median)

In [ ]:
def range_calc(x):
    return x.max() - x.min()

In [ ]:
recs.apply(range_calc)

Sorting and Ranking

Pandas objects include methods for re-ordering data.


In [ ]:
segments.sort_index().head()

In [ ]:
segments.sort_index(ascending=False).head()

Try sorting the columns instead of the rows, in ascending order:


In [ ]:
segments.sort_index(axis=1).head()

We can also use sort_values to sort a Series by value, rather than by label.


In [ ]:
segments.seg_length.sort_values(ascending=False).head(10)

For a DataFrame, we can sort according to the values of one or more columns using the by argument of sort_values:


In [ ]:
segments[['avg_sog','max_sog','min_sog']].sort_values(ascending=[False,True], 
                                           by=['max_sog', 'min_sog']).head(10)

Ranking does not re-arrange data, but instead returns an index that ranks each value relative to others in the Series.


In [ ]:
segments.avg_sog.rank()

Ties are assigned the mean value of the tied ranks, which may result in decimal values.


In [ ]:
pd.Series([100,100]).rank()

Alternatively, you can break ties via one of several methods, such as by the order in which they occur in the dataset:


In [ ]:
segments.avg_sog.rank(method='first').sort_values()

Calling the DataFrame's rank method results in the ranks of all columns:


In [ ]:
segments.rank(ascending=False).head()

Hierarchical indexing

While the mmsi (Maritime Mobile Service Identity) field represents a unique identifier in the vessels table, it does not in the segments table.

We can user hierarchical indexing to build a unique index, however, that may be more meaningful than the ordered set of integers that are given as a default index.


In [ ]:
segments_hier = segments.set_index(['mmsi', 'name', 'year', 'transit', 'segment'])
segments_hier.head(10)

In [ ]:
np.random.randint(0, len(segments))

In [ ]:
segments.loc[np.random.randint(0, len(segments))]

This index is a MultiIndex object that consists of a sequence of tuples, the elements of which is some combination of the three columns used to create the index. Where there are multiple repeated values, Pandas does not print the repeats, making it easy to identify groups of values.


In [ ]:
segments_hier.index.is_unique

Try using this hierarchical index to retrieve the first segment of the tenth transit of the Sentinel (mmsi=366766980):


In [ ]:
segments_hier.loc[(366766980, 'Sentinel', 2009, 10, 1)]

With a hierachical index, we can select subsets of the data based on a partial index:


In [ ]:
segments.loc[9]

Hierarchical indices can be created on either or both axes. Here is a trivial example:


In [ ]:
frame = pd.DataFrame(np.arange(12).reshape(( 4, 3)), 
                  index =[['a', 'a', 'b', 'b'], [1, 2, 1, 2]], 
                  columns =[['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']])

frame

If you want to get fancy, both the row and column indices themselves can be given names:


In [ ]:
frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']
frame

With this, we can do all sorts of custom indexing:


In [ ]:
frame.loc['a', 'Ohio']

Try retrieving the value corresponding to b2 in Colorado:


In [ ]:
# Write your answer here

Additionally, the order of the set of indices in a hierarchical MultiIndex can be changed by swapping them pairwise:


In [ ]:
segments_hier.swaplevel('mmsi', 'name').head()

In [ ]:
segments_hier.swaplevel('mmsi', 'name').loc['Sentinel']

Alternately, one can index based on an arbitrary level using the xs (cross-section) method:


In [ ]:
segments_hier.xs('Sentinel', level='name')

Data can also be sorted by any index level, using sortlevel:


In [ ]:
segments.sortlevel('max_sog', ascending=False).head()

Missing data

The occurence of missing data is so prevalent that it pays to use tools like Pandas, which seamlessly integrates missing data handling so that it can be dealt with easily, and in the manner required by the analysis at hand.

Missing data are represented in Series and DataFrame objects by the NaN floating point value. However, None is also treated as missing, since it is commonly used as such in other contexts (e.g. NumPy).


In [ ]:
foo = pd.Series([np.nan, -3, None, 'foobar'])
foo

In [ ]:
foo.isnull()

To illustrate how to deal with missing values in pandas, we will import an educational database, which reports test scores for children with hearing disabilities, along with associated covariates, several of which are missing.


In [ ]:
testing = pd.read_csv('../data/test_scores.csv', index_col=0)
testing.head(10)

Missing values may be dropped or indexed out:

By default, dropna drops entire rows in which one or more values are missing.


In [ ]:
testing.dropna().head(10)

A similar result can be obtained via indexing.


In [ ]:
testing.isnull().head()

In [ ]:
testing[testing.notnull()].head()

This can be overridden by passing the how='all' argument, which only drops a row when every field is a missing value.


In [ ]:
testing.dropna(how='all')

This can be customized further by specifying how many values need to be present before a row is dropped via the thresh argument.


In [ ]:
testing.dropna(thresh=10)

This is typically used in time series applications, where there are repeated measurements that are incomplete for some subjects.

Exercise

Try using the axis argument to drop columns with missing values:


In [ ]:
# Write your answer here

Rather than omitting missing data from an analysis, in some cases it may be suitable to fill the missing value in, either with a default value (such as zero) or a value that is either imputed or carried forward/backward from similar data points. We can do this programmatically in Pandas with the fillna argument.


In [ ]:
testing.fillna(0).head(10)

In [ ]:
testing.fillna({'family_inv': 0, 'mother_hs':1}).head(10)

Notice that fillna by default returns a new object with the desired filling behavior, rather than changing the Series or DataFrame in place (in general, we like to do this, by the way!).

Missing values can also be interpolated, using any one of a variety of methods:


In [ ]:
testing.fillna(method='bfill').head(10)

Data summarization

We often wish to summarize data in Series or DataFrame objects, so that they can more easily be understood or compared with similar data. The NumPy package contains several functions that are useful here, but several summarization or reduction methods are built into Pandas data structures.


In [ ]:
testing.sum()

Clearly, sum is more meaningful for some columns than others. For methods like mean for which application to string variables is not just meaningless, but impossible, these columns are automatically exculded:


In [ ]:
testing.mean()

The important difference between NumPy's functions and Pandas' methods is that the latter have built-in support for handling missing data.

Sometimes we may not want to ignore missing values, and allow the nan to propagate.


In [ ]:
testing.mean(skipna=False)

Passing axis=1 will summarize over rows instead of columns, which only makes sense in certain situations.


In [ ]:
nonenglish_nonwhite_withHS = testing[['non_english', 'mother_hs', 'non_white']].prod(axis=1)

nonenglish_nonwhite_withHS

A useful summarization that gives a quick snapshot of multiple statistics for a Series or DataFrame is describe:


In [ ]:
testing.describe()

describe can detect non-numeric data and sometimes yield useful information about it.


In [ ]:
testing.non_english.describe()

We can also calculate summary statistics across multiple columns, for example, correlation and covariance.

$$cov(x,y) = \sum_i (x_i - \bar{x})(y_i - \bar{y})$$

In [ ]:
testing.score.cov(testing.age_test)
$$corr(x,y) = \frac{cov(x,y)}{(n-1)s_x s_y} = \frac{\sum_i (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2 \sum_i (y_i - \bar{y})^2}}$$

In [ ]:
testing.score.corr(testing.age_test)

Try running corr on the entire testing DataFrame to see what is returned:


In [ ]:
# Write answer here

If we have a DataFrame with a hierarchical index (or indices), summary statistics can be applied with respect to any of the index levels:


In [ ]:
segments_hier.head()

In [ ]:
segments_hier.sum(level='transit').seg_length

Writing Data to Files

As well as being able to read several data input formats, Pandas can also export data to a variety of storage formats. We will bring your attention to just a couple of these.


In [ ]:
testing.to_csv("testing.csv")

The to_csv method writes a DataFrame to a comma-separated values (csv) file. You can specify custom delimiters (via sep argument), how missing values are written (via na_rep argument), whether the index is writen (via index argument), whether the header is included (via header argument), among other options.

An efficient way of storing data to disk is in binary format. Pandas supports this using Python’s built-in pickle serialization.


In [ ]:
testing.to_pickle("testing_pickle")

The complement to to_pickle is the read_pickle function, which restores the pickle to a DataFrame or Series:


In [ ]:
pd.read_pickle("testing_pickle").head()

As Wes warns in his book, it is recommended that binary storage of data via pickle only be used as a temporary storage format, in situations where speed is relevant. This is because there is no guarantee that the pickle format will not change with future versions of Python.

References

Python for Data Analysis Wes McKinney